Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Post-Training Embedding Binarization for Fast Online Top-K Passage Matching

147

5.11.2

Gradient Estimation

After obtaining the semantic-diﬀused embedding bag, a rescaled embedding binarization

for each one embedding of the contextualized bag is constructed as:

Bqi = ^||^ˆ^E^qⁱ^||¹

· sign(^ˆEqi),

(5.47)

where i ∈||^ˆEq|| and c denotes the embedding dimension. The binarized embedding bag Bq

sketches the original embeddings via (1) binarized codes and (2) embedding scaler, both

of which collaboratively reveal the value range of original embedding entries. Moreover,

such rescaled binarization supports the bit-wise operations for computation acceleration in

match-scoring prediction. To mitigate this, the authors further utilized the approximation

of Unit Impulse Function [58] to furnish the accordant gradient estimation as:

∂μ(t)

∂t

t = 1,

otherwise .

(5.48)

It is obvious to take a translation by sign(t) = 2μ(t)−1, and theoretically ^∂^sign(^t⁾

∂t

= 2 ^∂μ⁽^t⁾

∂t ^.

Furthermore, ^∂μ⁽^t⁾

∂t

can be introduced with zero-centered Gaussian probability density func-

tion as:

∂μ(t)

∂t

= lim

β→∞

|β|

√π exp(−(βt)²),

(5.49)

which implies that:

∂sign(t)

∂t

≈²^γ

√π exp(−(γt)²).

(5.50)

Intuitively, the estimator in Eq. (5.50) follows the main direction of factual gradients of

sign(·), which produces a coordinated embedding optimization for inputs with diverse value

ranges.

Similarly to ColBERT [113], Bi-ColBERT employed its proposed Late Interaction Mech-

anism for matching score computation, which is implemented by a sum of maximum simi-

larity computation with embedding dot-products:

Sq,d =

i∈[|Bp|]

max

j∈[|Bp|] ^B^qⁱ^·^B^T

dj^,

(5.51)

which can be equivalently implemented with bit-wise operations as follows:

Sq,d =

i∈[|Bq|]

max

j∈[|Bp|] ^B^qⁱ^{count(signxnor(sign(}^B^qⁱ^·^sign(^B^T

di⁾⁾⁾^.

(5.52)

The above equation replaces most of ﬂoating-point arithmetics with bit-wise operations,

providing the potentiality of online computation acceleration. Lastly, Bi-ColBERT adopts

the training paradigm of ColBERT that is optimized via the pairwise softmax cross-entropy

loss over the computed scores of positive and negative passage samples.

The proposed Bi-ColBERT is evaluated on the MS-MARCO Ranking dataset [182]. It

is a collection of 8.8M passages from 1M real-world queries to Bing. Each query is associ-

ated with sparse relevance judgments of one (or a small number of) documents marked as

relevant and no documents explicitly marked as irrelevant. The results listed in Table 5.8

suggests a trade-oﬀbetween passage searching quality and retrieval cost, where ColBERT

aims to simplify the neural architecture and Bi-ColBERT focuses on eﬀective embedding

binarization.